In [1]:
%env CUDA_VISIBLE_DEVICES=0 # limit GPU usage, if any to this GPU


env: CUDA_VISIBLE_DEVICES=0 # limit GPU usage, if any to this GPU

In [2]:
import numpy as np
from classifier import common
import os
labels = common.fetch_samples()

from sklearn.model_selection import train_test_split
np.random.seed(123)
y_train, y_test, sha256_train, sha256_test = train_test_split(
    list(labels.values()), list(labels.keys()), test_size=1000)

MalwaResNet

Clearly, my model needs a heavy dose of special fanciness. We'll turn to ResNet architecture! This builds on the end-to-end model in the previous notebook, but stacks things higher and deeper, thanks to our hand-crafted residual cell.

This model may require a hefty GPU. I ran it successfully on a single TITAN X (Pascal). If yours doesn't work, try enabling the lite=True options in create_model.

You can find code that defines the MalwaResNet model architecture at classifier/malwaresnet.py.


In [3]:
# for this demo, will slurp in only the first 256K (2**18) bytes of the file
max_file_length = int(2**18)
file_chunks = 16  # break file into this many chunks
file_chunk_size = max_file_length // file_chunks
batch_size = 4

In [ ]:
# That this is very long running cell, and 
# it may appear that the output is truncated before training completes

# let's train this puppy
from classifier import malwaresnet
import math
from keras.callbacks import LearningRateScheduler, EarlyStopping, ModelCheckpoint

# create_model(input_shape, byte_embedding_size=2, lite=False)
model_malwaresnet = malwaresnet.create_model(input_shape=(file_chunks, file_chunk_size), byte_embedding_size=2)
train_generator = common.generator(list(zip(sha256_train, y_train)), batch_size, file_chunks, file_chunk_size)
test_generator = common.generator(list(zip(sha256_test, y_test)), 1, file_chunks, file_chunk_size)
training_history = model_malwaresnet.fit_generator(train_generator,
                        steps_per_epoch=math.ceil(len(sha256_train) / batch_size),
                        epochs=20,
                        callbacks=[
                            EarlyStopping( patience=1 ),
                            ModelCheckpoint( 'malwaresnet.h5', save_best_only=True),
                            LearningRateScheduler(lambda epoch: common.schedule(epoch, start=0.1, decay=0.5, every=1))],
                        validation_data=test_generator,
                        validation_steps=len(sha256_test))


Using TensorFlow backend.
learning rate = 0.1
Epoch 1/20
24750/24750 [==============================] - 30939s - loss: 1.8194 - acc: 0.6434 - val_loss: 1.0777 - val_acc: 0.7020
learning rate = 0.05
Epoch 2/20
24750/24750 [==============================] - 31095s - loss: 0.9919 - acc: 0.6874 - val_loss: 0.8848 - val_acc: 0.6950
learning rate = 0.025
Epoch 3/20
24750/24750 [==============================] - 31509s - loss: 0.8041 - acc: 0.7172 - val_loss: 0.7376 - val_acc: 0.7740
learning rate = 0.00625
Epoch 5/20
24750/24750 [==============================] - 31157s - loss: 0.7655 - acc: 0.7404 - val_loss: 0.6755 - val_acc: 0.8070
learning rate = 0.003125
Epoch 6/20
24750/24750 [==============================] - 31124s - loss: 0.7669 - acc: 0.7300 - val_loss: 0.6939 - val_acc: 0.7580
learning rate = 0.0015625
Epoch 7/20
 6591/24750 [======>.......................] - ETA: 22762s - loss: 0.7473 - acc: 0.7501

Okay, no more snarkiness about impatient millenials from me. Each epoch is taking 31000 s = 517 min = 8hrs 36 minutes. But, we're going to remain optimistic that this is going to be awesome. With a name like MalwaResNet, how could it not be?


In [4]:
from keras.models import load_model
# we'll load the "best" model (in this case, the penultimate model) that was saved 
# by our ModelCheckPoint callback
model_malwaresnet = load_model('malwaresnet.h5')
# we could load the "best" model, but in this case, the "best" model is the penultimate, and not much better
# than the model we have in hand
y_pred = []
for sha256, lab in zip(sha256_test, y_test):
    y_pred.append(
        model_malwaresnet.predict_on_batch(
            np.asarray([common.get_file_data(sha256, lab, max_file_length)]).reshape(
                (-1, file_chunks, file_chunk_size))
        )
    )
common.summarize_performance(np.asarray(y_pred).flatten(), y_test, "End-to-end convnet")


Using TensorFlow backend.
** End-to-end convnet **
ROC AUC = 0.8613780511243514
threshold=0.8300181031227112: 0.08471074380165289 TP rate @ 0.009689922480620155 FP rate
confusion matrix @ threshold:
[[512   4]
 [443  41]]
accuracy @ threshold = 0.553
Out[4]:
(0.8613780511243514,
 0.8300181,
 0.0096899224806201549,
 0.084710743801652888,
 array([[512,   4],
        [443,  41]]),
 0.55300000000000005)

So much innovation! So much GPU heat to the universe! For this?!

(┛◉Д◉)┛彡┻━┻

For real, though. Even if MalwaResNet were a killer architecture for malware, it'd need to be trained on a lot more than a paltry 100K malicious/benign samples, and the optimization and architecture would need some revamping (for example, ResNet is trained over hundreds of epochs...on metric ton of relatively small images). But as it is, this special model isn't really even close in performance to our not special multilayer perceptron. (It's probable that if I let this train for another few weeks, we'd see some modest improvements...but not seeing it get near 0.99+ AUC.)

Let's wrap this up

Alright, the point of this exercise has been to demonstrate that

  1. You malware model isn't that special.
  2. Well, it might be special, but it probably isn't just your fancy architecture that makes it so. The first notebook in this series shows a not special model.
  3. Static malware detection isn't an "easy win" application for end-to-end deep learning models (e.g., ResNet) that were originally developed for image recognition. Furthermore, our training set in these notebooks is much too small for the expressive end-to-end models with millions of parameters.
  4. End-to-end deep learning for static malware classification is a cool idea, but requires some work and perhaps more research to beat out not special models with great data, great features, and maticulously inspected/corrected labels.
  5. If you're looking for a quick win, go with the "shallow" model that isn't that special. And do the work required to make its training data (and feature set) great.
  6. If you're a researcher looking to advance the state of the art in end-to-end deep leaning for static malware detection (no parsing), let's chat about creative ideas, layers and "gadgets" that might help get us there.